Ultrasound-based deep learning radiomics model for differentiating benign, borderline, and malignant ovarian tumours: a multi-class classification exploratory study

Background Accurate preoperative identification of ovarian tumour subtypes is imperative for patients as it enables physicians to custom-tailor precise and individualized management strategies. So, we have developed an ultrasound (US)-based multiclass prediction algorithm for differentiating between benign, borderline, and malignant ovarian tumours. Methods We randomised data from 849 patients with ovarian tumours into training and testing sets in a ratio of 8:2. The regions of interest on the US images were segmented and handcrafted radiomics features were extracted and screened. We applied the one-versus-rest method in multiclass classification. We inputted the best features into machine learning (ML) models and constructed a radiomic signature (Rad_Sig). US images of the maximum trimmed ovarian tumour sections were inputted into a pre-trained convolutional neural network (CNN) model. After internal enhancement and complex algorithms, each sample’s predicted probability, known as the deep transfer learning signature (DTL_Sig), was generated. Clinical baseline data were analysed. Statistically significant clinical parameters and US semantic features in the training set were used to construct clinical signatures (Clinic_Sig). The prediction results of Rad_Sig, DTL_Sig, and Clinic_Sig for each sample were fused as new feature sets, to build the combined model, namely, the deep learning radiomic signature (DLR_Sig). We used the receiver operating characteristic (ROC) curve and the area under the ROC curve (AUC) to estimate the performance of the multiclass classification model. Results The training set included 440 benign, 44 borderline, and 196 malignant ovarian tumours. The testing set included 109 benign, 11 borderline, and 49 malignant ovarian tumours. DLR_Sig three-class prediction model had the best overall and class-specific classification performance, with micro- and macro-average AUC of 0.90 and 0.84, respectively, on the testing set. Categories of identification AUC were 0.84, 0.85, and 0.83 for benign, borderline, and malignant ovarian tumours, respectively. In the confusion matrix, the classifier models of Clinic_Sig and Rad_Sig could not recognise borderline ovarian tumours. However, the proportions of borderline and malignant ovarian tumours identified by DLR_Sig were the highest at 54.55% and 63.27%, respectively. Conclusions The three-class prediction model of US-based DLR_Sig can discriminate between benign, borderline, and malignant ovarian tumours. Therefore, it may guide clinicians in determining the differential management of patients with ovarian tumours. Supplementary Information The online version contains supplementary material available at 10.1186/s12880-024-01251-2.


Introduction
Ovarian tumours are of various histological types, including benign, borderline, and malignant lesions [1][2][3].Benign tumours have good prognosis and are treated conservatively and with regular follow-up observations [2,4].Epithelial hyperplasia and nuclear atypia are more prominent in borderline ovarian tumours (BOTs) than in benign ovarian tumours; however, BOTs have no stromal invasion, unlike ovarian malignancies [5].BOTs have good prognosis, with a 10-year survival rate of > 95% for stages I, II, and III [6].The primary treatment for BOTs is surgical intervention; however, more than one-third of BOTs cases occur in women aged under 40 years who may want to conceive in the future [1].Therefore, prioritising fertility preservation in young women desiring to have children is crucial.Patients with malignant ovarian tumours should be referred to gynaecologic oncologists for further diagnosis and treatment, and depending on the stage of cancer, debulking surgery and chemotherapy may be considered [7].Different types of ovarian tumours have distinct clinical and pathological characteristics, treatment strategies, and prognoses.The early detection and treatment of ovarian malignancies can improve patient outcomes [8].Therefore, the preoperative identification of the nature of ovarian tumours is critical for patients and can guide physicians in developing individualised and precise management plans.
Ultrasonography, especially transvaginal, is considered the primary method for evaluating adnexal tumours [9,10].Currently, subjective assessment by ultrasound (US) experts is a relatively good method of distinguishing the nature of ovarian tumours.However, US specialists are few, and differences in subjective diagnoses among US physicians with different experience levels exist [11,12].Therefore, objectively and quantitatively analysing the various imaging features that may reveal the potential biological characteristics of tumours in a reproducible manner is necessary.
Radiomics is an emerging field of quantitative imaging that can significantly impact personalised medicine.They can mine quantitative features from medical images using high-throughput methods, which are then transformed into objective and structured data through complex algorithms and applied to clinical decision support systems to improve diagnosis, prognosis assessment, and prediction accuracy [13,14].Previous studies on computed tomography (CT)/magnetic resonance imaging (MRI)/US-based radiomics for differentiating benign and malignant ovarian tumours achieved satisfactory diagnostic results [15][16][17][18].However, radiomic features are predefined, including morphology, intensity, texture, and wavelet features, which are superficial and low-order, and cannot represent the heterogeneity of the entire tumour [19,20].Therefore, to accurately classify ovarian tumours, studying their deeper-and higher-level features is necessary.
Deep learning (DL) is a branch of machine learning (ML) that allows computing models with multiple processing layers to learn data representations at numerous abstraction levels [21].The convolutional neural network (CNN) is the most commonly used DL architecture type in medical image analysis [22].Suggestions that CNN-extracted features can provide various high-order features of images and apply them to specific clinical outcomes exist [20].Successful application of DL requires a large number of training sets.However, medical data sets are often limited in number.Many practical applications currently use CNNs pre-trained on ImageNet, known as transfer learning (TL), to replace DL [23,24].Research using deep transfer learning (DTL) to classify benign and malignant ovarian tumours has been successful [11,12,25].However, BOTs were categorised as malignant ovarian tumours for statistical analysis.Combining DL classification networks with traditional hand-crafted radiomics frameworks is a new development [26,27].Few reports exist on US-based combined DL radiomics (DLR) models as multi-classification prediction models for classifying ovarian tumours as benign, borderline, or malignant.We hypothesised that DLR could differentiate between benign, borderline, and malignant ovarian tumours.Hence, this study aimed to develop an USbased DLR to identify benign, borderline, and malignant ovarian lesions.

Study design and participants
We enrolled 849 patients with ovarian tumours confirmed by histopathological examination after surgical removal from July 2014 to October 2022.The inclusion criteria were: (a) complete US examination within 1 month before surgery and (b) a clear and definite US image of the target lesion.The exclusion criteria were: (a) poor image quality, (b) absent or incomplete US and clinical data, (c) pregnancy, (d) history of tumours in other parts of the body and ovarian metastatic cancer, (e) previous treatment before US examination or surgery, and (f ) pathological diagnosis obtained through biopsy and uncertain pathology results.A flowchart of the participants is shown in Fig. 1.
The study population was categorised into different labels based on pathological results, with benign ovarian tumours labelled as "class 0", BOT as "class 1", and malignant ovarian tumours as "class 2".Participant data was randomised into training and testing sets in a ratio of 8:2 using Python's statistical package.Our data random partitioning adopted a stratified method to handle imbalanced data between the training and testing sets; hence, the proportion of patients with benign, borderline, and malignant ovarian tumours in the total study population, training set, and testing set was similar.No data overlap occurred between the training and testing sets, avoiding the repeated use of data from the same patient [28].
The training set was used to learn the parameters and build the model, whereas the testing set was used to evaluate the generalisability of the selected model and prevent overfitting.

Ultrasound data acquisition
All participants underwent transvaginal ultrasonography whenever possible.If a mass was too large to be fully displayed on transvaginal ultrasonography, it could be supplemented with a transabdominal US.  [12,29,30].

The specialized assessment of ultrasound images
Initially, the ultrasound image was assessed by Doctor A, a seasoned gynecology and obstetrics ultrasound specialist with ten years of professional experience, who provided the initial diagnosis.Subsequently, Doctor B, another gynecology and obstetrics ultrasound expert with over 15 years of experience, confirmed the diagnosis.In cases of discordant opinions, a senior expert in gynecology and obstetrics ultrasound with more than two decades of experience was consulted, leading to a consensus through collaborative discussion.These doctors were unaware of the patient's clinical and biochemical indicators or pathological results.

Image pre-processing and regions of interest (ROI) segmentation
The grey-level ranges of two-dimensional images obtained using different US devices vary significantly, and the voxel spacing of images obtained using different US

Hand-crafted radiomics feature extraction and selection
We employed PyRadiomics (http://pyradiomics. readthedocs.io)to extract the handcrafted radiomic features.Subsequently, Z-score normalisation was performed to eliminate differences in the value scales of the extracted features.
A total of 1476 handcrafted radiomics features were extracted from tasks 1 and 2, including the first-order features, shape features, gray-level dependence matrix (GLDM), gray-level size zone matrix (GLSZM), graylevel run length matrix (GLRLM), and gray-level cooccurrence matrix (GLCM).The number and proportion of handcrafted radiomics features are presented in Fig. 2. The P-values for all handcrafted features are shown in Fig. 3.
First, we retained hand-crafted radiomic features with intra-/inter-class correlation coefficient > 0.8, to ensure the robustness and repeatability of these features.Only 1,444 features with P < 0.05 after a T-or Mann-Whitney U-test were retained.Subsequently, spearman correlation analysis was used to calculate the correlation between features.A feature with a correlation coefficient of more than 0.9 between any two features is retained; thus, using a greedy recursive deletion strategy to maintain the features strongly correlated with the predicted target, 295 features were retained.Finally, least absolute shrinkage and selection operator (LASSO) regression algorithms were used for feature selection.Depending on the regulation weight λ, LASSO shrinks all regression coefficients towards zero and sets the coefficients of the irrelevant features precisely to zero.We employed 10-fold crossvalidation with the minimum criteria to determine the optimal λ, where the final value of λ (0.016768) yielded the minimum cross-validation error.We retained 53 non-zero-coefficient features as optimal features.

The deep transfer learning procedure
We used DTL, a CNN model pre-trained on the Ima-geNet dataset, to avoid overfitting owing to the limited size of the training dataset.
Data augmentation is often required to improve DTL's prediction performance and generalisation ability in image classification because of imbalanced or insufficient data.Hence, we utilised horizontal flipping and random cropping for data augmentation, which helped increase the sample size and enhance the model performance.
To better perform the generalisation, we carefully set the learning rate.In this study, we adopted a cosine-decay learning rate algorithm.The learning rates are presented in Additional file 1.

Signature building
The baseline clinical data were analysed in the training set.Clinical parameters and ultrasonic semantic features with P < 0.05 were selected, and spearman correlation analysis was used to determine the linear relationship between these parameters.Parameters without a significant linear correlation were inputted into the support vector machine model to build clinical signature (Clinic_Sig).

Model assessment
In this study, we employed a one-versus-rest method, which is often applied in multiclass classification.We evaluated the model's performance based on receiver operating characteristic (ROC) curves and the area under the ROC curve (AUC).We used Precision, Recall, F1 score, macro-average, micro-average, and weighted average to assess the class of discrimination of one-versusrest for the ovarian tumours of each group and the whole.A confusion matrix was used to analyse the errors in the model.

Statistical analysis
Statistical analysis was performed using Python (https:// www.python.org/).Normally distributed variables are reported as mean ± standard deviation, whereas non-normally distributed variables are reported as median (interquartile range).Categorical variables are expressed as frequencies (percentages).One-way analysis of variance was used to compare the three data groups with normality and homogeneity criteria, and a rank-sum nonparametric test for multiple independent samples was adopted for variables with no normality and homogeneity.Categorical data were analysed using the chi-square

Patient characteristics
We included 849 patients in this study.Among them, 549 (64.66%), 55 (6.48%), and 245 (28.86%) had benign, borderline, and malignant ovarian tumours, respectively.The proportions of benign, borderline, and malignant ovarian tumours in the entire study group, training set, and testing set were approximately the same.The baseline characteristics are shown in Table 1.

The ultrasound expert assessment the benign, borderline, and malignant ovarian tumours
Ultrasound specialists demonstrated a high level of accuracy in distinguishing between benign and malignant ovarian tumours, with rates of 95.80% and 82.80%, respectively.Conversely, the accuracy in identifying borderline ovarian tumours was notably lower at 34.50% (Table 2).

The confusion matrix of the three-class classification prediction model
We used the confusion matrix to understand where the classifier model made the classification errors and their proportions (Fig. 4; Table 3).These multiclass classification prediction models had a high rate of correctly distinguishing benign ovarian tumours, 89.91%, 88.99%, 86.24%, and 82.57%, respectively).Clinic_Sig and Rad_Sig showed relatively poor accuracy in determining malignant ovarian tumours (16.33% and 38.78%, respectively).The classifier models Clinic_Sig and Rad_Sig cannot recognise BOT.The proportion of BOT identified by DLR was the highest at 54.55%.

Classification performance
The DLR_Sig three-class prediction model had the best overall and class-specific classification performance, with the micro/macro average AUC 0.90 and 0.84 on the testing set, respectively.The categories of identification AUC were 0.84 for benign, 0.85 for borderline, and 0.83 for malignant ovarian tumours (Fig. 5; Table 4).

Application of grad-CAM
Grad-CAM, which can produce a coarse localisation map highlighting the critical regions for classification targets, is proposed as a method for visualising the decisions of CNN models.The red areas of the heat map are crucial references for model decision-making [31].The site of concern for US diagnosis is consistent with the area of concern for CNN decision making (Fig. 6).
In Fig. 6a, b, and c, there was a solid low-echoic mass in one patient's pelvis, 100 mm in diameter.Rich blood flow signals were observed in and around the mass, with a CA125 of 8.67 U/ml.An US expert suggested that the patient had a malignant ovarian tumour.However, DTL_ Sig predicted benign lesions with a probability of 97.35%.Pathological results showed that it was a benign theca cell tumour.The prediction of DTL_Sig was highly consistent with the pathological diagnosis.
In Fig. 6d, e, and f, there was a cystic-solid mixed mass in a patient's pelvis, which was 112 mm in diameterand had a CA125 level of 206 U/ml.An US expert suggested that the patient had a malignant ovarian tumour.However, DTL_Sig indicated a BOT with an 85.98% probability.A pathological diagnosis of BOT was made.DTL_Sig prediction was highly consistent with the pathological diagnosis.

Discussion
The accurate prediction of the category of ovarian tumours is critical for patient-centred care.Studies on the multiclass classification of DLR to classify ovarian tumours are relatively scarce.In this study, we constructed four multiclassification prediction models to classify benign, borderline, and malignant ovarian tumours.We found that the DLR prediction model had the optimum ability to classify ovarian tumours and generalise the testing set.
Ultrasonography is the primary method for screening ovarian tumours.Serum tumour markers are essential for discovering and treating ovarian cancer, and CA125 is the most important biomarker for evaluating ovarian cancer [32].Inflammation is vital in the development and progression of ovarian cancer [33].Therefore, we collected US semantic features, serum tumour markers, and related inflammatory factors from the study population.These US semantic features and clinical parameters are typically obtained during routine examinations and do not add additional burden to the patient.We selected some semantic elements, serum tumour markers, and related inflammatory factors to construct Clinic_Sig.The Clinic_Sig three-class prediction model had poor overall and class-specific classification performance and could not predict BOT; the precision, recall, and F1 scores were all zero.
The US examinations were subjective.US experts have higher diagnostic accuracy than less experienced doctors; however, US experts are few [11].Recently, radiomics has become a powerful new method for quantifying features from medical images, including potential pathophysiological information of reference cancer tissues [34].Some studies have used MRI/CT/US-based radiomics to differentiate between benign and malignant ovarian tumours with higher diagnostic performance [15,35,36].However, these studies did not mention the classification of BOT.Qi et al. [16]  Table 1 Training and testing sets of clinical parameters and semantic features of ultrasound radiomics models to discriminate between benign, borderline, and malignant serous ovarian tumours and provided preoperative diagnostic information to differentiate the nature of ovarian tumours.However, this was a binary classification study.In our research, the Rad_Sig three-class prediction model could not predict BOT, and the precision, recall, and F1 scores were all zero.DL is becoming increasingly essential for image pattern recognition [21].Considering the limited scale of medical datasets, we used TL to replace DL.TL is beneficial because it improves the performance of a model built on small samples by utilising the knowledge learned in similar classification tasks [28].Gao et al. [25] and Christiansen et al. [11] developed a DTL model to identify benign and malignant ovarian tumours, equivalent to the diagnostic level of an US specialist.Chen et al. [12] developed DTL algorithms to distinguish malignant from benign ovarian tumours, comparable to expert subjective and ovarian adnexal reporting and data system assessments.However, they classified BOT as malignant ovarian tumours for statistical analysis.We used models pre-trained on ImageNet Resnet50 [11,37].The DTL_Sig three-class prediction model had good overall and class-specific classification performance, with the micro/macro average AUC 0.89 and 0.85 on the test set, respectively.Categories of identification AUC were 0.87 for benign, 0.82 for borderline, and 0.84 for malignant ovarian tumours.Although DTL performs well in various classification prediction tasks, it is a black-box algorithm that lacks interpretability, which restricts its application [31,38].Grad-CAM is employed as a method of depicting the decision-making of DL.In our study, as shown in Fig. 6, the site of concern for US experts making the diagnosis was consistent with the area of concern for CNN decision-making using Grad-CAM, and the DTL_Sig predictions were highly compatible with the pathological diagnosis results.
The combination of traditional manual radiomics and DTL algorithms, namely DLR, can effectively improve the accuracy and reliability of model predictions.It is currently a popular topic in ML for tumour research.Many studies [20,[38][39][40] show that the DLR model has a better prediction efficacy than Rad_Sig or DTL_Sig alone.The fusion process of data between traditional radiomics and DTL includes the fusion of features and decision levels, and the fusion of features often leads to overfitting because of many features [38].We constructed a combined model for the training set by fusing the predicted probabilities of Clinic_Sig, Rad_Sig, and DTL_Sig for each sample.The combined three-class prediction model, DLR_Sig, had the best overall and class-specific classification performance, with the micro/macro average AUC 0.90 and 0.84 on the testing set, respectively.Categories of identification AUC were 0.84 for benign, 0.85 for borderline, and 0.83 for malignant ovarian tumours.The combined three-class prediction model performance for predicting BOT was the best, and the categories of identification AUC, Precision, Recall, F1 score, and accuracy had the highest performances of 0.85, 42.86%, 54.55%, 57.14%, and 93.31%, respectively.The prevalence of BOTs predicted by DLR_Sig (54.55%) exceeded that determined by ultrasound experts (34.50%).
This study had limitations.First, this was a retrospective single-centre study with a small sample size.Larger prospective and multicentre studies are required to evaluate the applicability of predictive models in clinical practice.Second, owing to the strict inclusion and exclusion criteria for data in this study, bias could have been introduced in the model's training.Thirdly, in this study, we extracted features from two-dimensional US images.In future studies, we will include other modalities such as colour Doppler flow imaging, spectral Doppler imaging, and contrast-enhanced US to provide more predictive BOTs: borderline ovarian tumours; BMI: body mass index; RBC: red blood cell; WBC: white blood cell; NLR: neutrophil-to-lymphocyte ratios; PLR: platelet-tolymphocyte ratios; LMR: lymphocyte-to-monocyte ratios; dNLR: derived neutrophil-to-lymphocyte ratios; SII: systemic immune-inflammation index; CA125: carbohydrate antigen 125; *: P<0.05 Table 1 (continued) information.Lastly, ROI delineation and cropping of the top section of the tumour represented only one slice of the lesion and could not describe the heterogeneity of the entire tumour.In the future, we plan to store dynamic images of the whole tumour and input them into ML to obtain more comprehensive information.

Conclusion
We developed a combined multiclass classification model that integrated clinical and traditional radiomics with DTL decision-level information to discriminate the nature of ovarian tumours.The performance and generalisation of this model have intensified its feasibility for distinguishing between benign, borderline, and malignant ovarian tumours.

Fig. 1 9 LMR = L 10 9 M 10 9 SII = N 10 9 ×
Fig. 1 Inclusion and exclusion criteria for patients with ovarian tumours for the training and testing sets.Abbreviation: BOTs = borderline ovarian tumours

Table 2
The expert assessment the ultrasound images

Table 3
The error analysis of the three-class classification prediction model

Table 4
Overall and class-specific classification performance